The National Innovation Centre for Data runs projects with organisations to help them acquire new skills and innovate through data
🌠 Collaborators
🌠 Collaborators
📚 Theory
✋ What is natural language processing ?
Natural language processing is a subfield of Artificial Intelligence (AI) concerned with developing systems that deal with natural language
✋ What is language ?
Language is a structured system of communication containing the following elements:
a collection of principles describing how to create appropriate utterances (grammar)
a set of words relating to the world (vocabulary)
✋ What is natural language ?
Natural language is any language occurring in a human community by a process of use, repetition, and change without conscious planning or premeditation
✋ What is NOT a natural language ?
The following communication systems are not considered natural languages
constructed languages, i.e.
fictional languages, such as 👽Klingon
programming languages, such as 🐍Python
international auxiliary languages, such as 🎏Esperanto
non-human communication systems, such as 🐝bee dancing
In 1966, the ALPAC Report concluded that MT did not seem feasible
Phase 2: 1960s-1970s
Research focussed on building and querying knowledge bases
In 1961, the BASEBALL system was developed to answer questions about baseball
In 1964, the ELIZA system simulated a Rogerian psychotherapist
In 1970, the SHRDLU system simulated a robot which manipulated blocks on a table top with instructions given in English
📜 Ancient history of natural language processing
Phase 3: 1970s-1990s
Initially, research continued to focus on rule-based systems, with an emphasis on syntactic and semantic analysis
But, by late 1980s, researchers became more united in focusing on empiricism and probabilistic models (i.e. Hidden Markov Models)
Notable progress in practical tasks, such as speech recognition or automatic summarisation was made
Major trend focussing on evaluating performance emerged
Phase 4: 1990s-2017
In 2003, Bengio and colleagues suggested that neural networks could be used to model natural language
In 2013, Word2vec introduced a novel way to represent words as dense, continuous-valued vectors in a high-dimensional space, which was notably better than earlier word representation attempts (one-hot encoding or bag-of-words)
In 2017, ELMo introduced the concept of contextualized word embeddings
🚀 Phase 5: 2017-present
This NLP phase is characterised by finding the winning recipe for building a good language model:
\[\begin{equation}
\boxed{
\begin{array}{c}
\textit{winning recipe} \\
= \\
\textbf{huge amounts of easy to acquire data} \\
\times \\
\textbf{a simple, high-throughput way to consume it}
\end{array}
}
\end{equation}\]
💡 Breakthrough 1: subword tokenisation and dense embeddings
Most language models require numerical inputs and thus, text needs to be pre-processed into the expected model format. Text pre-processing focuses on:
splitting the input text into chunks (tokens)
converting each token to an integer (token ids) via look-up tables
mapping token ids to dense, continuous-valued vectors (embeddings)
💡 Breakthrough 1: subword tokenisation and dense embeddings
from transformers import AutoTokenizer# Define input text and checkpointinput_text ="NLP is the most interesting subfield of AI"checkpoint ="distilbert-base-uncased"# Initialize the tokenizertokenizer = AutoTokenizer.from_pretrained(checkpoint)# Tokenize the input textinput_tokens = tokenizer.tokenize(input_text)# Convert tokens to IDsinput_ids = tokenizer.convert_tokens_to_ids(input_tokens)# Display resultsresult_text = (f"Input Text: {input_text}\n"f"Tokenized Text: {input_tokens}\n"f"Token IDs: {input_ids}")print(result_text)
Input Text: NLP is the most interesting subfield of AI
Tokenized Text: ['nl', '##p', 'is', 'the', 'most', 'interesting', 'sub', '##field', 'of', 'ai']
Token IDs: [17953, 2361, 2003, 1996, 2087, 5875, 4942, 3790, 1997, 9932]
💡 Breakthrough 1: subword tokenisation and dense embeddings
import torchfrom transformers import AutoTokenizer, AutoModelinput_text ="NLP is the most interesting subfield of AI"# Initialize the tokenizer and modelcheckpoint ="distilbert-base-uncased"tokenizer = AutoTokenizer.from_pretrained(checkpoint)model = AutoModel.from_pretrained(checkpoint)# Tokenize the input textinput_tokens = tokenizer(input_text, return_tensors="pt")# Get the input embeddingsinput_embeddings = model.get_input_embeddings()embeddings = input_embeddings(input_tokens["input_ids"])# Display the actual embeddings and their shapeprint(f"Shape: {embeddings.shape}\n"f"Embeddings: {embeddings}")
Aims to reconstruct the input or predict missing parts of the input
💡 Breakthrough 2: self-supervised learning
Currently, the state-of-the art language models are pre-trained using self-supervised learning, usually via language modelling:
from transformers import pipeline# Define input textinput_text ="The goal of this workshop is to <mask> the audience about the power of NLP"# Initialise the pipeline for masked language modellingmodel = pipeline("fill-mask")# Get the resultsresults = model(input_text)# Display resultsfor i in results:print(f"{i['sequence']}\t{round(i['score'], 3)}")
The goal of this workshop is to educate the audience about the power of NLP 0.64
The goal of this workshop is to teach the audience about the power of NLP 0.278
The goal of this workshop is to inform the audience about the power of NLP 0.043
The goal of this workshop is to remind the audience about the power of NLP 0.011
The goal of this workshop is to engage the audience about the power of NLP 0.005
✋ What does pre-training learn ?
While language modelling appears simple, it is a very powerful technique to learn a wide range of things since input sequences can contain any type of information, for example:
# Example 1: Syntax"The quick brown fox jumps <mask> the lazy dog"# Example 2: Trivia / Knowledge"Newcastle University is located in <mask>"# Example 3: Sentiment"I've never laughed so much during a trip to the cinema. The Barbie movie was <mask>"# Example 4: Coreference"Will Ferrell stole the show in this movie. <mask> is such a good actor"# Example 5: Mathematics"I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, <mask>"
✋ Where does the data come from ?
The data used to pre-train language models is usually obtained from the internet, for example:
attention: captures contextual relationships between words, allowing to understand the importance of each word in the context of the entire input
positional encodings: retains information about the order of words in the input sequence, enabling effective handling of sequential information
parallel computation: can process the entire input sequence simultaneously, making them computationally efficient and enabling faster training and inference unlike sequential nets (e.g. RNNs)
Recall that a language model is a model that learns to fill in the blanks
The definition of a large language model is rather fuzzy and there are different ways to define it, i.e. based on:
the number of parameters
the amount of data used to train the model
the ability to carry out a wide range of tasks
👷 Large language model workflow
Large language models could be viewed as a subset of foundation models, which describe the paradigm shift within AI from developing task-specific models trained on narrow data to developing multi-purpose models trained on broad data
😱 Large Language Models for everything: size matters
Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora
🔍 Scaling laws
In 2020, J. Kaplan and colleagues demonstrated that the performance of Large Language Models appears to improve with scaling:
Breakthroughs 1-3 have had a notable impact on the performance of (generative) Large Language Models which are now state-of-the-art for most benchmark NLP tasks
Moreover, 2022, J. Wei and colleagues observed emergent abilities for many additional tasks
from transformers import pipelineinput_text ="American Football is the best sport in the world"sentiment_model = pipeline("sentiment-analysis")output = sentiment_model(input_text)# display resultsf"Sentiment is: {output[0]['label']} with a score of {output[0]['score']}"
'Sentiment is: POSITIVE with a score of 0.9998629093170166'
🔫 Sequence classification – zero-shot classification with custom labels
from transformers import pipelineinput_text ="Newcastle University is located in Newcastle upon Tyne"ner_model = pipeline("ner")output = ner_model(input_text, aggregation_strategy='simple')# display resultsfor i in output:print(f"{i['entity_group']}: {i['word']}")
ORG: Newcastle University
LOC: Newcastle upon Tyne
📌 Summarisation
Summarisation is the task of producing a shorter version of text while retaining its key points
There are two key variants of this task, including:
extractive summarisation: selecting spans of text from the input text to form a summary
abstractive summarisation: generating new text conditional on input text to form a summary
🏈 Summarisation – examples
Consider the following input text:
# define input textinput_text ="""American football (referred to simply as football in the United States and Canada), also known as gridiron, is a team sport played by two teams of eleven players on a rectangular field with goalposts at each end. The offense, the team with possession of the oval-shaped football, attempts to advance down the field by running with the ball or passing it, while the defense, the team without possession of the ball, aims to stop the offense's advance and to take control of the ball for themselves. The offense must advance at least ten yards in four downs or plays; if they fail, they turn over the football to the defense, but if they succeed, they are given a new set of four downs to continue the drive. Points are scored primarily by advancing the ball into the opposing team's end zone for a touchdown or kicking the ball through the opponent's goalposts for a field goal. The team with the most points at the end of a game wins.American football evolved in the United States, originating from the sports of soccer and rugby. The first American football match was played on November 6, 1869, between two college teams, Rutgers and Princeton, using rules based on the rules of soccer at the time. A set of rule changes drawn up from 1880 onward by Walter Camp, the "Father of American Football", established the snap, the line of scrimmage, eleven-player teams, and the concept of downs. Later rule changes legalized the forward pass, created the neutral zone and specified the size and shape of the football. The sport is closely related to Canadian football, which evolved in parallel with and at the same time as the American game, although its rules were developed independently from those of Camp. Most of the features that distinguish American football from rugby and soccer are also present in Canadian football. The two sports are considered the primary variants of gridiron football.American football is the most popular sport in the United States in terms of broadcast viewership audience. The most popular forms of the game are professional and college football, with the other major levels being high-school and youth football. As of 2012, nearly 1.1 million high-school athletes and 70,000 college athletes play the sport in the United States annually. The National Football League, the most popular American professional football league, has the highest average attendance of any professional sports league in the world. Its championship game, the Super Bowl, ranks among the most-watched club sporting events in the world. The league has an annual revenue of around US$15 billion, making it the most valuable sports league in the world. Other professional leagues exist worldwide, but the sport does not have the international popularity of other American sports like baseball or basketball."""
from transformers import pipeline# define the default abstractive summarisation pipelinesummary_model = pipeline("summarization")output = summary_model(input_text, min_length=10, max_length=100)# display resultsoutput[0]["summary_text"]
" American football is a team sport played by two teams of eleven players on a rectangular field with goalposts at each end . The offense, the team with possession of the football, attempts to advance down the field by running with the ball or passing it, while the defense aims to stop the offense's advance . First American football match was played on November 6, 1869, between two college teams, Rutgers and Princeton, using rules based on the rules of soccer at the time ."
❓ Question answering
Question answering is the task of retrieving an answer to a question posed in natural language
There are three key variants of this task, including:
extractive: the answer is a span of text from the context
open generative: the answer is free text based on the context
closed generative: the answer is free text; no context is provided
Note that the context can be either structured (e.g. tabular) or unstructured (e.g. textual)
❓ Question answering – extractive
Textual context:
from transformers import pipelinecontext ="My name is Larry the Cat and I live at 10 Downing Street"qa_model = pipeline("question-answering")question ="Where do I live?"output = qa_model(question=question, context=context)# display resultsf"Answer: {output['answer']}"
'Answer: 10 Downing Street'
❓ Question answering – extractive
Tabular context:
players titles
0 Patrick Mahomes 2
1 Tom Brady 7
2 Aaron Rodgers 1
3 Brock Purdy 0
from transformers import pipelineimport pandas as pd# prepare table + questiondictionary = {"players": ["Patrick Mahomes", "Tom Brady", "Aaron Rodgers", "Brock Purdy"], "titles": ["2", "7", "1", "0"]}context = pd.DataFrame.from_dict(dictionary)question ="which player have the most Super Bowls?"# pipeline modeltqa_model = pipeline("table-question-answering")output = tqa_model(table=context, query=question)# display resultsf"Answer: {output['cells'][0]}"
'Answer: Tom Brady'
❓ Question answering – extractive
Image context:
from transformers import pipelinefrom PIL import Imagecheckpoint ="naver-clova-ix/donut-base-finetuned-docvqa"image_qa = pipeline("document-question-answering", model=checkpoint)question ="What is the total?"image = Image.open("img/realistic-receipt.jpeg")output = image_qa(image=image, question=question)# display resultsprint(output)
from transformers import pipeline# define the text2text-generation pipelinetext2text_generator = pipeline("text2text-generation")question ="What is 42?"context ="42 is the answer to life, the universe and everything"output = text2text_generator(f"question: {question} context: {context}")# display resultsoutput[0]['generated_text']